At the beginning, a few immediately necessary packages will loaded. Other packages will be loaded thoughout the document, when they are needed and applied.
library(data.table) # package to handle datatables
library(plotly) # package for interactivity of plots
library(rstatix) # package for piping and basical statistical tests
# Setting the document colour unigecol and loading the dataset
unigecol = "#D20D63" # setting the colour that will be used for most visualisations (where applicable) to maintain a uniform impression
setwd("D:/Dokumente/Studium/Master/Université de Genève/Kurse/Creating Value Through Data Mining") # set working directory
cereals <- fread("Cereals.csv") # read CSV-file
The first step in this analysis will be to get an overview about the data in the dataset.
summary(cereals) # basic command to get a summary of the datatset
## name mfr type calories
## Length:77 Length:77 Length:77 Min. : 50.0
## Class :character Class :character Class :character 1st Qu.:100.0
## Mode :character Mode :character Mode :character Median :110.0
## Mean :106.9
## 3rd Qu.:110.0
## Max. :160.0
##
## protein fat sodium fiber
## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median :180.0 Median : 2.000
## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
##
## carbo sugars potass vitamins
## Min. : 5.0 Min. : 0.000 Min. : 15.00 Min. : 0.00
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 42.50 1st Qu.: 25.00
## Median :14.5 Median : 7.000 Median : 90.00 Median : 25.00
## Mean :14.8 Mean : 7.026 Mean : 98.67 Mean : 28.25
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00
## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00
## NA's :1 NA's :1 NA's :2
## shelf weight cups rating
## Min. :1.000 Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :2.000 Median :1.00 Median :0.750 Median :40.40
## Mean :2.208 Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :3.000 Max. :1.50 Max. :1.500 Max. :93.70
##
The dataset contains 16 variables and 77 rows of observations.
The main contents are one observation for each sort of cereal and
corresponding characteristics, i.e. nutrition values. Also, we can find
a variable shelf and rating. Most variables are numeric.
Missing values are only in carbo and sugar, 1 each and 2 NAs in potass.
These rows will not be removed in advance but whenever they become a
problem to keep the maximum number of rows for each analysis.
name: nominal
mfr: nominal
type: nominal
calories: numeric
protein: numeric
fat: numeric
sodium: numeric
fibre: numeric
carbo: numeric
sugar: numeric
potass: numeric
vitamins: numeric
shelf: ordinal
weight: numeric
cups: numeric
rating: numeric
This can be done through R’s sapply (e.g., sapply(data, mean, na.rm = TRUE)). No further comment will be made on the particular outcomes.
# using sapply for calculating the descriptive measures, removing NAs
sapply(cereals[,c(4:12,14:16)], mean, na.rm = TRUE)
## calories protein fat sodium fiber carbo sugars
## 106.883117 2.545455 1.012987 159.675325 2.151948 14.802632 7.026316
## potass vitamins weight cups rating
## 98.666667 28.246753 1.029610 0.821039 42.665705
sapply(cereals[,c(4:12,14:16)], min, na.rm = TRUE)
## calories protein fat sodium fiber carbo sugars potass
## 50.00000 1.00000 0.00000 0.00000 0.00000 5.00000 0.00000 15.00000
## vitamins weight cups rating
## 0.00000 0.50000 0.25000 18.04285
sapply(cereals[,c(4:12,14:16)], max, na.rm = TRUE)
## calories protein fat sodium fiber carbo sugars potass
## 160.00000 6.00000 5.00000 320.00000 14.00000 23.00000 15.00000 330.00000
## vitamins weight cups rating
## 100.00000 1.50000 1.50000 93.70491
sapply(cereals[,c(4:12,14:16)], sd, na.rm = TRUE)
## calories protein fat sodium fiber carbo sugars
## 19.4841191 1.0947897 1.0064726 83.8322952 2.3833640 3.9073256 4.3786564
## potass vitamins weight cups rating
## 70.4106360 22.3425225 0.1504768 0.2327161 14.0472887
ggplotly(
ggplot(gather(cereals[,c(4:12,14:16)]),aes(value)) + # ggplot with data from the quantitative variables of cereals
geom_histogram(bins = 10, fill = unigecol, alpha = 0.6) + # optics of the histograms
facet_wrap(~key,scales="free_x") + # make x-axis loose
theme(axis.ticks.x=element_blank(), # no ticks
axis.ticks.y=element_blank(), # no ticks
panel.spacing = unit(2, "lines") #spacing between the panels
) +
xlab("Values") + # x-label
ylab("Count") + # y-label
ggtitle("Quantitative variables - overview")
)
These 12 histograms show the distribution of the numeric variables. Some
seem to be approximately normal distributed, e.g. calories, carbo and
rating. Others like sugar rather seem to be equally distributed but
there are various forms of other distributions as well. At the x-axis,
one can see that the data are not normalised since all variables don’t
have a mean of 0 and that all variables range in different areas and
intervals. Sodium for example ranges between 0 and more than 300 while
cups does so between 0.4 and 1.6.
The highest standard deviations (and therefore variability) can be
found at sodium (83.83) and potass (70.41).
Rating, potass, fibre, fat, vitamins and protein seem to be
skewed.
To get a precise impression which variables are indeed skewed, the
skewness values will be computed as follows:
library(e1071) #package for skewness
sapply(cereals[,c(4:12,14:16)], skewness, na.rm = TRUE) # calculating skewness
## calories protein fat sodium fiber carbo
## -0.42820348 0.71702319 1.12095401 -0.55347524 2.33775467 0.10831521
## sugars potass vitamins weight cups rating
## 0.04270638 1.34483857 2.36854640 0.29788885 -0.10092594 0.87508345
Assuming a fairly symmetrical distribution for values ∈ [-0.5,0.5], a
moderate skewness for values ∈ [-1,-0.5) and values ∈ (0.5,1] and strong
skewness for values (-∞, -1) and values ∈ (1,∞), the following can be
stated:
not skewed: calories, carbo, sugars, weight, cups
moderately skewed: protein, sodium, rating
strongly skewed: fat, fiber, potass, vitamins
By the previous seen visualisations, one can state following two
observations:
As mentioned before, fat, fiber, potass and vitamins are strongly
skewed.
The distributions of weight and and vitamins seem to be central.
But to get real insight into outliers, it will be necessary to use
boxplots of the quantitative variables. This will be done in the
following abstract.
library(gridExtra)
plot1 <- ggplot(cereals, aes(y=cereals$calories)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Calories") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot2 <- ggplot(cereals, aes(y=cereals$carbo)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Carbo") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot3 <- ggplot(cereals, aes(y=cereals$cups)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Cups") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot4 <- ggplot(cereals, aes(y=cereals$fat)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Fat") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot5 <- ggplot(cereals, aes(y=cereals$fiber)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Fiber") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot6 <- ggplot(cereals, aes(y=cereals$potass)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Potass") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot7 <- ggplot(cereals, aes(y=cereals$protein)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Protein") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot8 <- ggplot(cereals, aes(y=cereals$rating)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Rating") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot9 <- ggplot(cereals, aes(y=cereals$sodium)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Sodium") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot10 <- ggplot(cereals, aes(y=cereals$sugars)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Sugars") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot11 <- ggplot(cereals, aes(y=cereals$vitamins)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Vitamins") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
plot12 <- ggplot(cereals, aes(y=cereals$weight)) +
geom_boxplot(fill = unigecol, outlier.colour = "black", outlier.size = 3) +
labs(title = "Weight") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank())
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, plot8, plot9, plot10, plot11, plot12, ncol=3)
As mentioned above, weight seems to be very centralised, so everything
that is not approximately equal to 1 appears to be outlying. A similar
picture is drawn for vitamins, just with a value of about 25. For cups,
fiber, potass, rating and protein, there are a few outliers, all above
the 75%-quantile. Only calories has outliers in a significant number and
in both directions. (beside weight and vitamins)
ggplotly(
ggplot(cereals, aes(x = cereals$type, y = cereals$calories)) + #ggplot with type on the x-axis and calories on the y-axis
geom_boxplot(fill = c("blue","red"), color = "black") + # optics
ggtitle("Calories of cold and hot cereals") + #title
labs(x = "Cold and hot cereals", y = "Calories") # axis-labels
)
The right plot for cold cereals tells that the amount of calories
varies mostly (75 %) between 100 and 110 calories. A few outliers in
both directions can be detected.
The plot on the left is not very informative due to a lack of data.
There are only three hot cereals in the dataset and these have exactly
100 calories each, see the following table.
cereals[type == "H", c("name", "type", "calories")] # picking the names, types (just for visualisation) and calories of hot cereals
## name type calories
## 1: Cream_of_Wheat_(Quick) H 100
## 2: Maypo H 100
## 3: Quaker_Oatmeal H 100
ggplotly(
ggplot(cereals, aes(x = as.factor(cereals$shelf), y = cereals$rating)) + # ggplot with shelf as a factor on the x-axis and the rating on the y-axis
geom_boxplot(fill = unigecol, color = "black") + # optics
ggtitle("Rating in accordance to shelf type") + # title
labs(x = "Shelf types", y = "Rating") # labels
)
As one can see, the outer plot of shelf height 1 and 3 are quite similar
except the median that is slightly higher for shelf height 1. If
predicting consumer rating from the information shelf height = 1 or
shelf height = 3, we would get a similar result. One of them could
therefore be abandoned. For shelf height 2, an obvious lower consumer
rating can be observed. This distinction should be kept.
library(reshape2) # needed for the correlation matrix
library(DT) # needed for representing the tables in a proper way
corr_mat <- round(cor(cereals[,c(4:12,14:16)], use ="complete.obs"),2) # new datatable with all numeric variables, using only complete observations
DT::datatable(corr_mat)
library(GGally)
library(ggpubr)
melted_corr_mat <- melt(corr_mat)
ggplotly(
ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill=value)) + geom_tile() + geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) + scale_fill_gradient(low=unigecol, high="white")+ theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) + ggtitle("Correlation matrix for numeric variables") + xlab("") + ylab("")
)
The darker the color appears to either white or red, the stronger is the correlation in the positive respectively negative direction.
# interactivity through plotly has been removed consciously due to the generally crowdedness of the graph and therefore no real additional value through plotly.
ggpairs(
cereals[,c(4:12,14:16)], upper = "blank"
) + labs(title = "Matrix plot for quantitative variables") + rremove("x.axis") +
rremove("xlab") +
rremove("x.text") +
rremove("x.ticks") +
rremove("legend") +
rremove("y.axis") +
rremove("ylab") +
rremove("y.text") +
rremove("y.ticks")
Unfortunately, there are many quantitative variables, so that the plot
appears very crowded. General statements can barely be made.
When 2 variables are correlated, on can easily leave one of them out
to reduce the number of variables. This is due to the situation that the
extra information that one of these variables delivers can also be
derived from the respective correlated variable because one could use
this second variable to predict the first one.
In the following, both matrices, for un-normalised and normalised
data will be (re-)plotted and compared.
cor_mat <- round(cor(cereals[,c(4:12,14:16)], use ="complete.obs"),2)
norm_cor_mat <- round(cor(scale(cereals[,c(4:12,14:16)]), use ="complete.obs"),2) # normalised data so that each variable has a mean value of 0 and unit variance value, only complete observations, rounded on the second digit
melt_cor_mat <- melt(cor_mat)
melt_norm_cor_mat <- melt(norm_cor_mat)
# plotting the un-normalised data
ggplotly(
ggplot(data = melt_cor_mat, aes(x=Var1, y = Var2, fill=value)) + geom_tile() + geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) + scale_fill_gradient(low=unigecol, high="white")+ theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) + ggtitle("Correlation matrix for numeric variables") + xlab("") + ylab("")
)
# plotting the normalised data
ggplotly(
ggplot(data = melt_norm_cor_mat, aes(x=Var1, y = Var2, fill=value)) + geom_tile() + geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) + scale_fill_gradient(low=unigecol, high="white")+ theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) + ggtitle("Correlation matrix for numeric variables") + xlab("") + ylab("")
)
The plots look the same but to be sure not to overlook differences, the
following table will compare the single values.
DT::datatable(cor_mat == norm_cor_mat)
The result did not change. Standardisation has no effect on the
correlation matrix.